Introduction

I have decided to do this project on a topic that concerns me greatly and makes me very sad. Hopefully some light can be thrown on this topic. The World Health Organization reported every 40 seconds a person somewhere in the world commits suicide. Despite this outrageously high statistic, WHO said only a handful of countries have policies aimed at suicide prevention. Source:https://www.who.int/

alt text here

library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(gapminder)
library(ggthemes)
library(ggpubr)
## Loading required package: magrittr
library(cowplot)    
## 
## Attaching package: 'cowplot'
## The following object is masked from 'package:ggpubr':
## 
##     get_legend
## The following object is masked from 'package:ggthemes':
## 
##     theme_map
## The following object is masked from 'package:ggplot2':
## 
##     ggsave
library(grid)
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
library(gridExtra)    
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(viridisLite)
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
## 
##     col_factor
library(DT) 


options(scipen=999)

Import Data and Problems

m<-read.csv("master.csv")
str(m)
## 'data.frame':    27820 obs. of  12 variables:
##  $ ï..country        : Factor w/ 101 levels "Albania","Antigua and Barbuda",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year              : int  1987 1987 1987 1987 1987 1987 1987 1987 1987 1987 ...
##  $ sex               : Factor w/ 2 levels "female","male": 2 2 1 2 2 1 1 1 2 1 ...
##  $ age               : Factor w/ 6 levels "15-24 years",..: 1 3 1 6 2 6 3 2 5 4 ...
##  $ suicides_no       : int  21 16 14 1 9 1 6 4 1 0 ...
##  $ population        : int  312900 308000 289700 21800 274300 35600 278800 257200 137500 311000 ...
##  $ suicides.100k.pop : num  6.71 5.19 4.83 4.59 3.28 2.81 2.15 1.56 0.73 0 ...
##  $ country.year      : Factor w/ 2321 levels "Albania1987",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ HDI.for.year      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gdp_for_year....  : Factor w/ 2321 levels "1,002,219,052,968",..: 727 727 727 727 727 727 727 727 727 727 ...
##  $ gdp_per_capita....: int  796 796 796 796 796 796 796 796 796 796 ...
##  $ generation        : Factor w/ 6 levels "Boomers","G.I. Generation",..: 3 6 3 2 1 2 6 1 2 3 ...
glimpse(m)
## Observations: 27,820
## Variables: 12
## $ ï..country         <fct> Albania, Albania, Albania, Albania, Albania...
## $ year               <int> 1987, 1987, 1987, 1987, 1987, 1987, 1987, 1...
## $ sex                <fct> male, male, female, male, male, female, fem...
## $ age                <fct> 15-24 years, 35-54 years, 15-24 years, 75+ ...
## $ suicides_no        <int> 21, 16, 14, 1, 9, 1, 6, 4, 1, 0, 0, 0, 2, 1...
## $ population         <int> 312900, 308000, 289700, 21800, 274300, 3560...
## $ suicides.100k.pop  <dbl> 6.71, 5.19, 4.83, 4.59, 3.28, 2.81, 2.15, 1...
## $ country.year       <fct> Albania1987, Albania1987, Albania1987, Alba...
## $ HDI.for.year       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ gdp_for_year....   <fct> "2,156,624,900", "2,156,624,900", "2,156,62...
## $ gdp_per_capita.... <int> 796, 796, 796, 796, 796, 796, 796, 796, 796...
## $ generation         <fct> Generation X, Silent, Generation X, G.I. Ge...
sum(complete.cases(m))
## [1] 8364
sum(is.na(m))
## [1] 19456
summary(m)
##        ï..country         year          sex                 age      
##  Austria    :  382   Min.   :1985   female:13910   15-24 years:4642  
##  Iceland    :  382   1st Qu.:1995   male  :13910   25-34 years:4642  
##  Mauritius  :  382   Median :2002                  35-54 years:4642  
##  Netherlands:  382   Mean   :2001                  5-14 years :4610  
##  Argentina  :  372   3rd Qu.:2008                  55-74 years:4642  
##  Belgium    :  372   Max.   :2016                  75+ years  :4642  
##  (Other)    :25548                                                   
##   suicides_no        population       suicides.100k.pop
##  Min.   :    0.0   Min.   :     278   Min.   :  0.00   
##  1st Qu.:    3.0   1st Qu.:   97498   1st Qu.:  0.92   
##  Median :   25.0   Median :  430150   Median :  5.99   
##  Mean   :  242.6   Mean   : 1844794   Mean   : 12.82   
##  3rd Qu.:  131.0   3rd Qu.: 1486143   3rd Qu.: 16.62   
##  Max.   :22338.0   Max.   :43805214   Max.   :224.97   
##                                                        
##       country.year    HDI.for.year            gdp_for_year....
##  Albania1987:   12   Min.   :0.483   1,002,219,052,968:   12  
##  Albania1988:   12   1st Qu.:0.713   1,011,797,457,139:   12  
##  Albania1989:   12   Median :0.779   1,016,418,229    :   12  
##  Albania1992:   12   Mean   :0.777   1,018,847,043,277:   12  
##  Albania1993:   12   3rd Qu.:0.855   1,022,191,296    :   12  
##  Albania1994:   12   Max.   :0.944   1,023,196,003,075:   12  
##  (Other)    :27748   NA's   :19456   (Other)          :27748  
##  gdp_per_capita....           generation  
##  Min.   :   251     Boomers        :4990  
##  1st Qu.:  3447     G.I. Generation:2744  
##  Median :  9372     Generation X   :6408  
##  Mean   : 16866     Generation Z   :1470  
##  3rd Qu.: 24874     Millenials     :5844  
##  Max.   :126352     Silent         :6364  
## 
names(m) 
##  [1] "ï..country"         "year"               "sex"               
##  [4] "age"                "suicides_no"        "population"        
##  [7] "suicides.100k.pop"  "country.year"       "HDI.for.year"      
## [10] "gdp_for_year...."   "gdp_per_capita...." "generation"
  m<-rename(m, "country"="ï..country","gdp.c"="gdp_per_capita....","gdp.y"="gdp_for_year....")

I started to clean the data then I thought “I should probably check how many countries are in this set”

select(m,country) %>% unique %>% nrow 
## [1] 101
unique(m$country)
##   [1] Albania                      Antigua and Barbuda         
##   [3] Argentina                    Armenia                     
##   [5] Aruba                        Australia                   
##   [7] Austria                      Azerbaijan                  
##   [9] Bahamas                      Bahrain                     
##  [11] Barbados                     Belarus                     
##  [13] Belgium                      Belize                      
##  [15] Bosnia and Herzegovina       Brazil                      
##  [17] Bulgaria                     Cabo Verde                  
##  [19] Canada                       Chile                       
##  [21] Colombia                     Costa Rica                  
##  [23] Croatia                      Cuba                        
##  [25] Cyprus                       Czech Republic              
##  [27] Denmark                      Dominica                    
##  [29] Ecuador                      El Salvador                 
##  [31] Estonia                      Fiji                        
##  [33] Finland                      France                      
##  [35] Georgia                      Germany                     
##  [37] Greece                       Grenada                     
##  [39] Guatemala                    Guyana                      
##  [41] Hungary                      Iceland                     
##  [43] Ireland                      Israel                      
##  [45] Italy                        Jamaica                     
##  [47] Japan                        Kazakhstan                  
##  [49] Kiribati                     Kuwait                      
##  [51] Kyrgyzstan                   Latvia                      
##  [53] Lithuania                    Luxembourg                  
##  [55] Macau                        Maldives                    
##  [57] Malta                        Mauritius                   
##  [59] Mexico                       Mongolia                    
##  [61] Montenegro                   Netherlands                 
##  [63] New Zealand                  Nicaragua                   
##  [65] Norway                       Oman                        
##  [67] Panama                       Paraguay                    
##  [69] Philippines                  Poland                      
##  [71] Portugal                     Puerto Rico                 
##  [73] Qatar                        Republic of Korea           
##  [75] Romania                      Russian Federation          
##  [77] Saint Kitts and Nevis        Saint Lucia                 
##  [79] Saint Vincent and Grenadines San Marino                  
##  [81] Serbia                       Seychelles                  
##  [83] Singapore                    Slovakia                    
##  [85] Slovenia                     South Africa                
##  [87] Spain                        Sri Lanka                   
##  [89] Suriname                     Sweden                      
##  [91] Switzerland                  Thailand                    
##  [93] Trinidad and Tobago          Turkey                      
##  [95] Turkmenistan                 Ukraine                     
##  [97] United Arab Emirates         United Kingdom              
##  [99] United States                Uruguay                     
## [101] Uzbekistan                  
## 101 Levels: Albania Antigua and Barbuda Argentina Armenia ... Uzbekistan

This data set is missing a lot most importanly china india and at least 90 countries.

       m1<- m%>%
group_by(country) %>% 
  summarise(total_suicides = sum(suicides_no,na.rm=TRUE)) %>%
  arrange(desc(total_suicides))
m1
## # A tibble: 101 x 2
##    country            total_suicides
##    <fct>                       <int>
##  1 Russian Federation        1209742
##  2 United States             1034013
##  3 Japan                      806902
##  4 France                     329127
##  5 Ukraine                    319950
##  6 Germany                    291262
##  7 Republic of Korea          261730
##  8 Brazil                     226613
##  9 Poland                     139098
## 10 United Kingdom             136805
## # ... with 91 more rows

I thought to check my data set against another data set. I chose the nations of the world data set which I got from kaggle.

nations <- read_csv("nations.csv")
## Parsed with column specification:
## cols(
##   iso2c = col_character(),
##   iso3c = col_character(),
##   country = col_character(),
##   year = col_double(),
##   gdp_percap = col_double(),
##   population = col_double(),
##   birth_rate = col_double(),
##   neonat_mortal_rate = col_double(),
##   region = col_character(),
##   income = col_character()
## )
select(nations,country) %>% unique %>% nrow
## [1] 211
unique(nations$country)
##   [1] "Andorra"                        "United Arab Emirates"          
##   [3] "Afghanistan"                    "Antigua and Barbuda"           
##   [5] "Albania"                        "Armenia"                       
##   [7] "Angola"                         "Argentina"                     
##   [9] "American Samoa"                 "Austria"                       
##  [11] "Australia"                      "Aruba"                         
##  [13] "Azerbaijan"                     "Bosnia and Herzegovina"        
##  [15] "Barbados"                       "Bangladesh"                    
##  [17] "Belgium"                        "Burkina Faso"                  
##  [19] "Bulgaria"                       "Bahrain"                       
##  [21] "Burundi"                        "Benin"                         
##  [23] "Bermuda"                        "Brunei Darussalam"             
##  [25] "Bolivia"                        "Brazil"                        
##  [27] "Bahamas, The"                   "Bhutan"                        
##  [29] "Botswana"                       "Belarus"                       
##  [31] "Belize"                         "Canada"                        
##  [33] "Congo, Dem. Rep."               "Central African Republic"      
##  [35] "Congo, Rep."                    "Switzerland"                   
##  [37] "Cote d'Ivoire"                  "Chile"                         
##  [39] "Cameroon"                       "China"                         
##  [41] "Colombia"                       "Costa Rica"                    
##  [43] "Cuba"                           "Curacao"                       
##  [45] "Cyprus"                         "Czech Republic"                
##  [47] "Germany"                        "Djibouti"                      
##  [49] "Denmark"                        "Dominica"                      
##  [51] "Dominican Republic"             "Algeria"                       
##  [53] "Ecuador"                        "Estonia"                       
##  [55] "Egypt, Arab Rep."               "Eritrea"                       
##  [57] "Spain"                          "Ethiopia"                      
##  [59] "Finland"                        "Fiji"                          
##  [61] "Micronesia, Fed. Sts."          "France"                        
##  [63] "Gabon"                          "United Kingdom"                
##  [65] "Grenada"                        "Georgia"                       
##  [67] "Ghana"                          "Gibraltar"                     
##  [69] "Greenland"                      "Gambia, The"                   
##  [71] "Guinea"                         "Equatorial Guinea"             
##  [73] "Greece"                         "Guatemala"                     
##  [75] "Guam"                           "Guinea-Bissau"                 
##  [77] "Guyana"                         "Hong Kong SAR, China"          
##  [79] "Honduras"                       "Croatia"                       
##  [81] "Haiti"                          "Hungary"                       
##  [83] "Indonesia"                      "Ireland"                       
##  [85] "Israel"                         "Isle of Man"                   
##  [87] "India"                          "Iraq"                          
##  [89] "Iran, Islamic Rep."             "Iceland"                       
##  [91] "Italy"                          "Channel Islands"               
##  [93] "Jamaica"                        "Jordan"                        
##  [95] "Japan"                          "Kenya"                         
##  [97] "Kyrgyz Republic"                "Cambodia"                      
##  [99] "Kiribati"                       "Comoros"                       
## [101] "St. Kitts and Nevis"            "Korea, Rep."                   
## [103] "Kuwait"                         "Cayman Islands"                
## [105] "Kazakhstan"                     "Lao PDR"                       
## [107] "Lebanon"                        "St. Lucia"                     
## [109] "Liechtenstein"                  "Sri Lanka"                     
## [111] "Liberia"                        "Lesotho"                       
## [113] "Lithuania"                      "Luxembourg"                    
## [115] "Latvia"                         "Libya"                         
## [117] "Morocco"                        "Monaco"                        
## [119] "Moldova"                        "Montenegro"                    
## [121] "St. Martin (French part)"       "Madagascar"                    
## [123] "Marshall Islands"               "Macedonia, FYR"                
## [125] "Mali"                           "Myanmar"                       
## [127] "Mongolia"                       "Macao SAR, China"              
## [129] "Northern Mariana Islands"       "Mauritania"                    
## [131] "Malta"                          "Mauritius"                     
## [133] "Maldives"                       "Malawi"                        
## [135] "Mexico"                         "Malaysia"                      
## [137] "Mozambique"                     "Namibia"                       
## [139] "New Caledonia"                  "Niger"                         
## [141] "Nigeria"                        "Nicaragua"                     
## [143] "Netherlands"                    "Norway"                        
## [145] "Nepal"                          "New Zealand"                   
## [147] "Oman"                           "Panama"                        
## [149] "Peru"                           "French Polynesia"              
## [151] "Papua New Guinea"               "Philippines"                   
## [153] "Pakistan"                       "Poland"                        
## [155] "Puerto Rico"                    "West Bank and Gaza"            
## [157] "Portugal"                       "Palau"                         
## [159] "Paraguay"                       "Qatar"                         
## [161] "Romania"                        "Serbia"                        
## [163] "Russian Federation"             "Rwanda"                        
## [165] "Saudi Arabia"                   "Solomon Islands"               
## [167] "Seychelles"                     "Sudan"                         
## [169] "Sweden"                         "Singapore"                     
## [171] "Slovenia"                       "Slovak Republic"               
## [173] "Sierra Leone"                   "San Marino"                    
## [175] "Senegal"                        "Somalia"                       
## [177] "Suriname"                       "South Sudan"                   
## [179] "Sao Tome and Principe"          "El Salvador"                   
## [181] "Sint Maarten (Dutch part)"      "Syrian Arab Republic"          
## [183] "Swaziland"                      "Turks and Caicos Islands"      
## [185] "Chad"                           "Togo"                          
## [187] "Thailand"                       "Tajikistan"                    
## [189] "Timor-Leste"                    "Turkmenistan"                  
## [191] "Tunisia"                        "Tonga"                         
## [193] "Turkey"                         "Trinidad and Tobago"           
## [195] "Tuvalu"                         "Tanzania"                      
## [197] "Ukraine"                        "Uganda"                        
## [199] "United States"                  "Uruguay"                       
## [201] "Uzbekistan"                     "St. Vincent and the Grenadines"
## [203] "Venezuela, RB"                  "Virgin Islands (U.S.)"         
## [205] "Vietnam"                        "Vanuatu"                       
## [207] "Samoa"                          "Yemen, Rep."                   
## [209] "South Africa"                   "Zambia"                        
## [211] "Zimbabwe"

present problems Still,this data set had the reverse problem of the previous one, with countries being repeated or countries that no longer exsisted or territoires such as St. Vincent and the Grenadines,bieng included in data. So I finally went to the internet and just looked it up. Countries in the World:195 195 which breaks down as follows: 54 countries are in Africa 48 in Asia 44 in Europe 33 in Latin America and the Caribbean 14 in Oceania 2 in Northern America Source:https://www.worldometers.info/geography/how-many-countries-are-there-in-the-world/

General Problems with data * 7 countries had less than 3 years of data total * 2016 data had almost no countries.The countries that were represented often had data missing. * HDI had 2/3 missing data * Generation variable has problems(not ordinal) * Africa has very few countries providing suicide data * Countries that have big population such as China and India are absent from the data. * The lack general lack of countries,there are only 101 out of 196

So quite naturally I took the high road and imported another data set.

 who<- read_csv("who-suicide-statistics/who_suicide_statistics.csv")
## Parsed with column specification:
## cols(
##   country = col_character(),
##   year = col_double(),
##   sex = col_character(),
##   age = col_character(),
##   suicides_no = col_double(),
##   population = col_double()
## )
str(who)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 43776 obs. of  6 variables:
##  $ country    : chr  "Albania" "Albania" "Albania" "Albania" ...
##  $ year       : num  1985 1985 1985 1985 1985 ...
##  $ sex        : chr  "female" "female" "female" "female" ...
##  $ age        : chr  "15-24 years" "25-34 years" "35-54 years" "5-14 years" ...
##  $ suicides_no: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ population : num  277900 246800 267500 298300 138700 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   country = col_character(),
##   ..   year = col_double(),
##   ..   sex = col_character(),
##   ..   age = col_character(),
##   ..   suicides_no = col_double(),
##   ..   population = col_double()
##   .. )
glimpse(who)
## Observations: 43,776
## Variables: 6
## $ country     <chr> "Albania", "Albania", "Albania", "Albania", "Alban...
## $ year        <dbl> 1985, 1985, 1985, 1985, 1985, 1985, 1985, 1985, 19...
## $ sex         <chr> "female", "female", "female", "female", "female", ...
## $ age         <chr> "15-24 years", "25-34 years", "35-54 years", "5-14...
## $ suicides_no <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ population  <dbl> 277900, 246800, 267500, 298300, 138700, 34200, 301...
sum(complete.cases(who))
## [1] 36060
sum(is.na(who))
## [1] 7716

Already better only a 16th of the data are NA’s.

summary(who)
##    country               year          sex                age           
##  Length:43776       Min.   :1979   Length:43776       Length:43776      
##  Class :character   1st Qu.:1990   Class :character   Class :character  
##  Mode  :character   Median :1999   Mode  :character   Mode  :character  
##                     Mean   :1999                                        
##                     3rd Qu.:2007                                        
##                     Max.   :2016                                        
##                                                                         
##   suicides_no        population      
##  Min.   :    0.0   Min.   :     259  
##  1st Qu.:    1.0   1st Qu.:   85113  
##  Median :   14.0   Median :  380655  
##  Mean   :  193.3   Mean   : 1664091  
##  3rd Qu.:   91.0   3rd Qu.: 1305698  
##  Max.   :22338.0   Max.   :43805214  
##  NA's   :2256      NA's   :5460
       who1<- who%>%
group_by(country) %>% 
  summarise(total_suicides = sum(suicides_no,na.rm=TRUE)) %>%
  arrange(desc(total_suicides))
head(who1)
## # A tibble: 6 x 2
##   country                  total_suicides
##   <chr>                             <dbl>
## 1 Russian Federation              1500992
## 2 United States of America        1201401
## 3 Japan                            937614
## 4 France                           395500
## 5 Ukraine                          365170
## 6 Germany                          291262

Cleaning the data

sapply(who, function(x) sum(is.na(x)))
##     country        year         sex         age suicides_no  population 
##           0           0           0           0        2256        5460

So,we can see that the na’s are shown in suicides and the population variables.

na_df <- who[is.na(who$suicides_no) | is.na(who$population),]
nrow(na_df)
## [1] 7716

Handling NA values

There are in total 7716 rows that contain missing values which is equal to 17.6% of the whole given dataset. We will try to do some sorting of Na’s in order to figure out the NA values for both variables (Population, Suicide Number) are seen mostly in a specific year or country.This is to see if there is bias in a concentrated form or are the NA’s random and thereby havig less of an overall effect.

na_population <- who[is.na(who$population),]

na_population$country <- factor(na_population$country, 
                                levels = unique(na_population$country))
na_population_by_country <- as.data.frame(table(na_population$country))
colnames(na_population_by_country) <- c('country', 'frequence')
# order data frame by decreasing frequnce
na_population_by_country <- na_population_by_country[order(-na_population_by_country$frequence),]
# order factor so that we can plot in decreasing freqence
na_population_by_country$country <- factor(na_population_by_country$country, 
                                           levels = unique(na_population_by_country$country[
                                               order(-na_population_by_country$frequence, 
                                                     na_population_by_country$country)]))
# plotting na values of population by country in decreasing order
ggplot(data=na_population_by_country, aes(x=country, y=frequence, fill = country)) +
  geom_bar(stat="identity", width = 0.3) +
  theme(axis.text.x=element_blank()) + ggtitle('NA Population Values per Country')

na_population$year <- factor(na_population$year, levels = unique(na_population$year))
na_population_by_year <- as.data.frame(table(na_population$year))
colnames(na_population_by_year) <- c('year', 'frequence')
na_population_by_year$year <- factor(na_population_by_year$year, 
                                     levels = 1978:2016, ordered = T)
ggplot(data=na_population_by_year, aes(x=year, y=frequence, fill = year)) +
  geom_bar(stat="identity", width = 0.3) +
  theme(axis.text.x=element_blank()) + ggtitle('NA Population values per Year')

na_suicides <- who[is.na(who$suicides_no),]
# we remove levels from the country and year factor that are missing
na_suicides$country <- factor(na_suicides$country, levels = unique(na_suicides$country))
na_suicides_by_country <- as.data.frame(table(na_suicides$country))
colnames(na_suicides_by_country) <- c('country', 'frequence')
# order levels of countries depending on the missing rows

# order data frame by decreasing frequence
na_suicides_by_country <- na_suicides_by_country[order(-na_suicides_by_country$frequence),]
# order factor so that we can plot in decreasing freqence
na_suicides_by_country$country <- factor(na_suicides_by_country$country, 
                                         levels = unique(na_suicides_by_country$country[
                                             order(-na_suicides_by_country$frequence, 
                                                   na_suicides_by_country$country)]))


ggplot(data=na_suicides_by_country, aes(x=country, y=frequence, fill = country)) +
  geom_bar(stat="identity", width = 0.3) + 
  ggtitle('NA suicide_no values per Country') + 
  theme(axis.text.x=element_blank())

na_suicides$year <- factor(na_suicides$year, levels = unique(na_suicides$year))
na_suicides_by_year <- as.data.frame(table(na_suicides$year))
colnames(na_suicides_by_year) <- c('year', 'frequence')
na_suicides_by_year$year <- factor(na_suicides_by_year$year, 
                                   levels = 1978:2016, ordered = T)

ggplot(data=na_suicides_by_year, aes(x=year, y=frequence, fill = year)) +
  geom_bar(stat="identity", width = 0.3) + 
  ggtitle('NA suicide_no Values per Year') + 
  theme(axis.text.x=element_blank())

na_population_by_age <- as.data.frame(table(na_population$age))
na_population_by_age
##          Var1 Freq
## 1 15-24 years  910
## 2 25-34 years  910
## 3 35-54 years  910
## 4  5-14 years  910
## 5 55-74 years  910
## 6   75+ years  910
na_suicides_by_age <- as.data.frame(table(na_suicides$age))
na_suicides_by_age
##          Var1 Freq
## 1 15-24 years  376
## 2 25-34 years  376
## 3 35-54 years  376
## 4  5-14 years  376
## 5 55-74 years  376
## 6   75+ years  376

Results

Plotting the NA values by year and by country lead us to the conclusion, that there is no particular connection between the missing data and the variables and they are pretty random. However, there are countries whose corresponding groups have always at least one NA value. (e.g Peru does not have any registered population)

On the other hand, the fact that rows having NA Population are spread equally to the age groups, leads to the conclusion that all data of a year of a specific country should be missing (e.g. there is not a case where the suicided number for Denmark is missing only for the age group 15-24, but instead all age groups of this country for this year have NA value for the suicides variable).

who$suicides_no <- as.numeric(who$suicides_no)
who$population <- as.numeric(who$population)

There are many countries missing per year. We can see that nearly for half of the time period that we examine, we have data for at most 100 countries out of the 141 (so unfortunately not much better than are original data set ) mentioned in total. In order to handle these issues, there are different approaches that we could consider:

1A possible idea would be to complete all the missing rows with NA values and then try to impute/predict all of them. The problem is that on some occasions a lot of data are missing which makes us believe that probably it would be a bad idea to try to predict all of them, as a lot of bias would be added in our predictions. 2The other idea would be to fill in the data by searching the data in the WHO Database https://www.who.int/mental_health/prevention/suicide/countrydata/en/ and other internet resources. 3The third one would be to try to impute just the existing NA values using MICE package. 4The final fourth one would be to try to just keep just the complete cases of the existing dataset.

I think the fouth option is the most diplomatic option so that’s what I’m going to choose. #Exploration and visualizing ## By Country

whoDF_s1 <- who %>% 
  group_by(country) %>% 
  summarise(total_suicides = sum(suicides_no,na.rm=TRUE)) %>%
  arrange(desc(total_suicides))

ggplot(whoDF_s1,aes(x=reorder(country,-total_suicides),y=total_suicides,fill=-total_suicides))+
  geom_col() + theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  labs(x="Country",y="Count",title="Countrie's Suicides Stats")+
  theme(plot.title = element_text(size=15,face="bold"))

By Year

whoDF_s3 <- who %>% 
  group_by(year,country) %>% 
  summarise(total_suicides = sum(suicides_no,na.rm=TRUE)) %>%
  arrange(desc(total_suicides))

ggplot(whoDF_s3,aes(x=year,y=total_suicides,fill=-total_suicides))+
  geom_col() +
  labs(x="Year",y="Count",title="Suicides Worldwide")+
  theme(plot.title = element_text(size=15,face="bold"))

Insights Clearly missing Data at the end of the Graph Missing Data in the 80’s Which is because of NA’s from Russia In that period. ## By Age

whoDF_sa <- who %>% 
  group_by(age) %>% 
  summarise(total_suicides = sum(suicides_no,na.rm=TRUE)) %>%
  arrange(desc(total_suicides))

ggplot(whoDF_sa,aes(x=reorder(age,-total_suicides),y=total_suicides,fill=-total_suicides))+
  geom_col() + theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  labs(x="Age",y="Count",title=" Age Suicides Stats")+
  theme(plot.title = element_text(size=15,face="bold"))

## By Gender

We Can clearly observe that the highest rate of suicide ocurrs at Middle Age. This surprisied me because I assumed the rate would be highest by the eldearly or teenagers.

whoDF_sb <- who %>% 
  group_by(sex) %>% 
  summarise(total_suicides = sum(suicides_no,na.rm=TRUE)) %>%
  arrange(desc(total_suicides))

ggplot(whoDF_sb,aes(x=reorder(sex,-total_suicides),y=total_suicides,fill=-total_suicides))+
  geom_col() + theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  labs(x="Gender",y="Count",title="Suicide by Gender")+
  theme(plot.title = element_text(size=15,face="bold"))

t.test(suicides_no ~ sex , data = who , alternative = "less")
## 
##  Welch Two Sample t-test
## 
## data:  suicides_no by sex
## t = -26.091, df = 24030, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf -190.5464
## sample estimates:
## mean in group female   mean in group male 
##              91.6316             294.9992

We got the significant value as p-value is less than 2.2e-16. Therefore, we can firmly say that male are more likely to commited suicides than female in these time period.

This may be due what Simon Haber vice-chair of research for the Department of Psychiatry at the University of Ottawa says. “Women are actually more likely to try to kill themselves - three to four times more likely. But men are more likely to die from it. That’s a pattern that holds true across Canada, and in most of the rest of the world as well. That’s mainly due to two things:One is that men use more lethal means [to attempt suicide], and the second is that they don’t seek care as much.”

Insights

  • Globally, the rate of suicide for men has been ~3.5x higher for men
  • Both male & female suicide rates peaked in 1995, declining since and went up slightly in the late 2000’s
  • This ratio of 3.5 : 1 (male : female) has remained relatively constant since the mid 90’s
  • However, during the 80’s this ratio was as low as 2.7 : 1 (male : female)

Comparing the nations with the most Suicide by the numbers.

whoDF_s6 <- who %>% 
  filter(country == "Russian Federation",year%in%c("1997","2015","1980")) %>%
  group_by(sex,age,year) %>% 
  summarise(total_suicides = sum(suicides_no,na.rm=TRUE))

ggplot(whoDF_s6,aes(x=factor(age,levels = c("5-14 years","15-24 years","25-34 years","35-54 years","55-74 years","75+ years")),y=total_suicides,fill=sex))+
  geom_col() + theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  facet_wrap(whoDF_s6$year) +
  labs(x="Age",y="Count",title="Suicides in Russia")+
  theme(plot.title = element_text(size=15,face="bold"))

whoDF_s7 <- who %>% 
  filter(country == "United States of America",year%in%c("1997","2015","1980")) %>%
  group_by(sex,age,year) %>% 
  summarise(total_suicides = sum(suicides_no,na.rm=TRUE))

ggplot(whoDF_s7,aes(x=factor(age,levels = c("5-14 years","15-24 years","25-34 years","35-54 years","55-74 years","75+ years")),y=total_suicides,fill=sex))+
  geom_col() + theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  facet_wrap(whoDF_s6$year) +
  labs(x="Age",y="Count",title="Suicides in USA")+
  theme(plot.title = element_text(size=15,face="bold"))

In short In russia Suicides have gone down while in the Usa it has gone up

TC <- who %>% 
  select(country, year, sex, age, suicides_no, population,) %>%
  filter(country %in% c("Russian Federation","United States of America","Japan","France","Ukraine","Germany","Republic of Korea","Brazil","Poland","United Kingdom" ))

I wanted to see which countries had the most suicides in this data set by sheer numbers.

whoDF_s8 <- TC%>% 
  group_by(country) %>% 
  summarise(total_suicides = sum(suicides_no,na.rm=TRUE)) %>%
  arrange(desc(total_suicides))

ggplot(whoDF_s8,aes(x=reorder(country,-total_suicides),y=total_suicides,fill=-total_suicides))+
  geom_col() + theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  labs(x="Country",y="Count",title="Top Ten Countries")+
  theme(plot.title = element_text(size=18,face="bold"))

Its Kinda a hodge podge. I guess Russia and Japan and USA makes sense but the rest of them (besides the European aspect) have very little in common.This breaks apart the idea that its because of single culture or economy or code of ethics. Every can point his finger at Japan and say " its very pressured there“, but that is a micro not macro observation.

df_top10 <- who %>% 
  filter( (country== "Russian Federation")| (country== "United States of America") | (country=="Japan")| (country=="France")|  (country== "Ukraine") |(country=="Germany") |(country=="Republic of Korea")|(country=="Brazil")|(country== "United Kingdom")| (country== "Poland")   )
df7<- who%>%
filter(   (country=="Japan")| (country=="France")|  (country== "Ukraine") |(country=="Germany") |(country=="Republic of Korea")|(country=="Brazil")|(country== "United Kingdom")   )
whoDF_10 <- df_top10 %>% 
  group_by(year,country) %>% 
  summarise(total_suicides = sum(suicides_no,na.rm=TRUE)) %>%
  arrange(desc(total_suicides))

ggplot(whoDF_10,mapping = aes(x=year, y=total_suicides, colour = country)) +geom_line(aes(linetype = country))

This is to show where the Na’s kick in.

hop <- df7 %>% 
  group_by(year,country) %>% 
  summarise(total_suicides = sum(suicides_no,na.rm=TRUE)) %>%
  arrange(desc(total_suicides))

ggplot(data = hop, mapping = aes(x =country , y = total_suicides)) + 
  geom_boxplot() +
  coord_flip()

This is me taking out the top three and trying to find a pattern.

Final Visualization

df_country_group <- who %>%
group_by(country) %>%
summarise(sumsui= sum( as.numeric(suicides_no)),popsum = sum(as.numeric(population)))
df_country_group <- na.omit(df_country_group)
df_country_group_ratio<- df_country_group %>% mutate(ratioSP =sumsui/popsum)
df_country_group_ratio <- df_country_group_ratio %>% arrange(-ratioSP)
df_country_group_ratio_top <- head(df_country_group_ratio,10)

So for this I wanted first to show something satistical.What I’ve done is instead of just having the countries with biggest numbers I wanted to get a ratio showing the countries that proportionaly have largest amounts of suicide.

 df_country_group_ratio_top1 <- df_country_group_ratio_top %>%
   arrange(ratioSP)
ggplot(df_country_group_ratio_top1,aes(x = ratioSP,   y = country )) +
  scale_y_discrete(limits= df_country_group_ratio_top1$country) +
 geom_segment( aes(xend=0,yend=country),size = 1,color='red2')+
  geom_point(fill="red2",color="green",size=4,shape=21,stroke=2) +
  ggtitle("Countries with highest suicide to population ratio")+
   labs(x="Ratio : Suicide/Population", y="Country")

Made it Interactive.

df_country_group_ratio_top1 <- df_country_group_ratio_top %>%
    arrange(ratioSP)
    
ggplotly(
    ggplot(df_country_group_ratio_top1,aes(x = ratioSP,   y = country )) +
    scale_y_discrete(limits= df_country_group_ratio_top1$country) +
    geom_segment( aes(xend=0,yend=country),size = 1,color='red2')+
    geom_point(fill="red2",color="red4",size=4,shape=21,stroke=2) +
    #ggtitle("Countries with highest suicide to population ratio")+
    labs(x="Ratio : Suicide/Population", y="Country")
)

Final Thoughts

Based on the 2016 National Survey of Drug Use and Mental Health it is estimated that 0.5 percent of the adults aged 18 or older made at least one suicide attempt. This translates to approximately 1.3 million adults. Adult females reported a suicide attempt 1.2 times as often as males. Further breakdown by gender and race are not available. This data set needed more varibles. Suprisied about middle age people being the most prone to suicide. In conclusion Suicide is epidemic with a large amount of causes but a dearth of solutions. There needs to be more research conducted on the subject and less preconcived notions.

Essay

I chose this data set primarily because suicide exists in the category of things, that while being researched, is one of those areas of suffering that we have yet to get a definite scientific handle on. Part of the reason is because there is no clear definite reason for what exactly causes suicide. Directly related to this is the mix of the numerous reasons and the degree of those reasons why one would get to the point of committing suicide. Human beings are complex creatures, and actively do and are acted upon for host of dependent and independent reasons; all the while interfacing with a world which can frankly be cruel at times. The biggest hindrance ironically is the very thing that give us our understanding. The human mind is as multivariate and complex as any weather pattern as powerful as any super-computer as scattered and seemingly random as the billions of causes and reasons that happens with man’s daily interaction with his world. All of what have stated above leads to a situation in which people can point fingers at the stereotypical and false outliers and saying “that’s what causes suicide”. What I wanted to show at the very least with this data set was the truly global nature of suicide and how it effects all races, groups countries regardless of socioeconomic standing. This is what the New York Times Reported In 2016;" When it comes to suicide and suicide attempts there are rate differences depending on demographic characteristics such as age, gender, ethnicity and race. Nonetheless, suicide occurs in all demographic groups“. Part of the problem is classification as the WHO Reports.”In 2015, 505,507 people visited a hospital for injuries due to self-harm. This number suggests that for every reported suicide death, approximately 11.4 people visit a hospital for self-harm related injuries. However, because of the way these data are collected, we are not able to distinguish intentional suicide attempts from non-intentional self-harm behaviors." In my project there were many obstacles to overcome such as; figuring out the Na’s screwed around with the plots. Another difficulty was just figuring out how to group everything. I would say my biggest challenge was when I mutated the last visualization to have a new variable to get the real ratio of suicide numbers in a country. My smallest challenge was accidently deleting my project this morning and trying to rush everything to get the project done by the dead line, “still have 20 minutes”.2:44 When cleaning the data, I did consider trying to retrieve the information, but it proved too cumbersome. That the same fate my met interactive map fell to I really wish I could have executed that properly. Its sad, partially because this data set was a bit bare bones it sort of conformed my preconditioned biases. The truth is there are some clear indicators of suicide there just often shaded by other unseen variables. I was surprised about how global suicide it like other human things it tragicaly knows no boundary nor country or coulter. The age thing also really shocked me who would have though middle age people would be the most prone to suicide. In conclusion I hope that there will be more research conducted about suicide and one day we will be rid of this very human curse.

Biblography: New York times United Nations Development Program. (2018). Human development index (HDI). Retrieved from http://hdr.undp.org/en/indicators/137506

World Bank. (2018). World development indicators: GDP (current US$) by country:1985 to 2016. Retrieved from http://databank.worldbank.org/data/source/world-development-indicators#

[Szamil]. (2017). Suicide in the Twenty-First Century [dataset]. Retrieved from https://www.kaggle.com/szamil/suicide-in-the-twenty-first-century/notebook

World Health Organization. (2018). Suicide prevention. Retrieved from http://www.who.int/mental_health/suicide-prevention/en/